62 research outputs found

    Gapped consensus motif discovery: evaluation of a new algorithm based on local multiple alignments and a sampling strategy

    Get PDF
    We check the efficiency and faisability of a novel method designed for the discovery of a priori unknown motifs described as gaps alternating with specific regions. Such motifs are searched for as consensi of non homologous biological sequences. The only specifications required concern the maximal gap length, the minimal frequency for specific characters and the minimal percentage (quorum) of sequences sharing the motif. Our method is based on a cooperation between a multiple alignment method for a quick detection of local similarities and a sampling strategy running candidate position specific scoring matrices to convergence. This rather original way implemented for converging to the solution proves efficient both on simulated data, gapped instances of the so-called challenge problem, promoter sites in Dicot plants and transcription factor binding sites in E.Coli. Our algorithm compares favorably with the MEME and STARS approaches in terms of accuracy

    A hierarchical Bayesian network approach for linkage disequilibrium modeling and data-dimensionality reduction prior to genome-wide association studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Discovering the genetic basis of common genetic diseases in the human genome represents a public health issue. However, the dimensionality of the genetic data (up to 1 million genetic markers) and its complexity make the statistical analysis a challenging task.</p> <p>Results</p> <p>We present an accurate modeling of dependences between genetic markers, based on a forest of hierarchical latent class models which is a particular class of probabilistic graphical models. This model offers an adapted framework to deal with the fuzzy nature of linkage disequilibrium blocks. In addition, the data dimensionality can be reduced through the latent variables of the model which synthesize the information borne by genetic markers. In order to tackle the learning of both forest structure and probability distributions, a generic algorithm has been proposed. A first implementation of our algorithm has been shown to be tractable on benchmarks describing 10<sup>5 </sup>variables for 2000 individuals.</p> <p>Conclusions</p> <p>The forest of hierarchical latent class models offers several advantages for genome-wide association studies: accurate modeling of linkage disequilibrium, flexible data dimensionality reduction and biological meaning borne by latent variables.</p

    Large-scale computational and statistical analyses of high transcription potentialities in 32 prokaryotic genomes

    Get PDF
    This article compares 32 bacterial genomes with respect to their high transcription potentialities. The σ70 promoter has been widely studied for Escherichia coli model and a consensus is known. Since transcriptional regulations are known to compensate for promoter weakness (i.e. when the promoter similarity with regard to the consensus is rather low), predicting functional promoters is a hard task. Instead, the research work presented here comes within the scope of investigating potentially high ORF expression, in relation with three criteria: (i) high similarity to the σ70 consensus (namely, the consensus variant appropriate for each genome), (ii) transcription strength reinforcement through a supplementary binding site—the upstream promoter (UP) element—and (iii) enhancement through an optimal Shine-Dalgarno (SD) sequence. We show that in the AT-rich Firmicutes’ genomes, frequencies of potentially strong σ70-like promoters are exceptionally high. Besides, though they contain a low number of strong promoters (SPs), some genomes may show a high proportion of promoters harbouring an UP element. Putative SPs of lesser quality are more frequently associated with an UP element than putative strong promoters of better quality. A meaningful difference is statistically ascertained when comparing bacterial genomes with similarly AT-rich genomes generated at random; the difference is the highest for Firmicutes. Comparing some Firmicutes genomes with similarly AT-rich Proteobacteria genomes, we confirm the Firmicutes specificity. We show that this specificity is neither explained by AT-bias nor genome size bias; neither does it originate in the abundance of optimal SD sequences, a typical and significant feature of Firmicutes more thoroughly analysed in our study

    A Bayesian network approach to model local dependencies among SNPs

    Get PDF
    In this preliminary work, we investigate a method to model linkage disequilibrium among SNPs (Single Nucleotide Polymorphisms) in the genome. The genetic data such as SNPs is characterized by a typical block-like structure along the genome. Graphical models such as Bayesian networks can provide a fine and biologically relevant modeling of dependencies for both haplotypical and genotypical SNP data. We applied a MWST-based algorithm (Maximum Weighted Spanning Tree) to construct a Bayesian network, relying on the underlying local dependencies

    Alternative Methods for H1 Simulations in Genome Wide Association Studies

    Full text link
    Assessing the statistical power to detect susceptibility variants plays a critical role in GWA studies both from the prospective and retrospective points of view. Power is empirically estimated by simulating phenotypes under a disease model H1. For this purpose, the "gold" standard consists in simulating genotypes given the phenotypes (e.g. Hapgen). We introduce here an alternative approach for simulating phenotypes under H1 that does not require generating new genotypes for each simulation. In order to simulate phenotypes with a fixed total number of cases and under a given disease model, we suggest three algorithms: i) a simple rejection algorithm; ii) a numerical Markov Chain Monte-Carlo (MCMC) approach; iii) and an exact and efficient backward sampling algorithm. In our study, we validated the three algorithms both on a toy-dataset and by comparing them with Hapgen on a more realistic dataset. As an application, we then conducted a simulation study on a 1000 Genomes Project dataset consisting of 629 individuals (314 cases) and 8,048 SNPs from Chromosome X. We arbitrarily defined an additive disease model with two susceptibility SNPs and an epistatic effect. The three algorithms are consistent, but backward sampling is dramatically faster than the other two. Our approach also gives consistent results with Hapgen. Using our application data, we showed that our limited design requires a biological a priori to limit the investigated region. We also proved that epistatic effects can play a significant role even when simple marker statistics (e.g. trend) are used. We finally showed that the overall performance of a GWA study strongly depends on the prevalence of the disease: the larger the prevalence, the better the power

    Visualization of Pairwise and Multilocus Linkage Disequilibrium Structure Using Latent Forests

    Get PDF
    Linkage disequilibrium study represents a major issue in statistical genetics as it plays a fundamental role in gene mapping and helps us to learn more about human history. The linkage disequilibrium complex structure makes its exploratory data analysis essential yet challenging. Visualization methods, such as the triangular heat map implemented in Haploview, provide simple and useful tools to help understand complex genetic patterns, but remain insufficient to fully describe them. Probabilistic graphical models have been widely recognized as a powerful formalism allowing a concise and accurate modeling of dependences between variables. In this paper, we propose a method for short-range, long-range and chromosome-wide linkage disequilibrium visualization using forests of hierarchical latent class models. Thanks to its hierarchical nature, our method is shown to provide a compact view of both pairwise and multilocus linkage disequilibrium spatial structures for the geneticist. Besides, a multilocus linkage disequilibrium measure has been designed to evaluate linkage disequilibrium in hierarchy clusters. To learn the proposed model, a new scalable algorithm is presented. It constrains the dependence scope, relying on physical positions, and is able to deal with more than one hundred thousand single nucleotide polymorphisms. The proposed algorithm is fast and does not require phase genotypic data

    Pharmacogenetic analysis of high-dose methotrexate treatment in children with osteosarcoma.

    Get PDF
    Inter-individual differences in toxic symptoms and pharmacokinetics of high-dose methotrexate (MTX) treatment may be caused by genetic variants in the MTX pathway. Correlations between polymorphisms and pharmacokinetic parameters and the occurrence of hepato- and myelotoxicity were studied. Single nucleotide polymorphisms (SNPs) of the ABCB1, ABCC1, ABCC2, ABCC3, ABCC10, ABCG2, GGH, SLC19A1 and NR1I2 genes were analyzed in 59 patients with osteosarcoma. Univariate association analysis and Bayesian network-based Bayesian univariate and multilevel analysis of relevance (BN-BMLA) were applied. Rare alleles of 10 SNPs of ABCB1, ABCC2, ABCC3, ABCG2 and NR1I2 genes showed a correlation with the pharmacokinetic values and univariate association analysis. The risk of toxicity was associated with five SNPs in the ABCC2 and NR1I2 genes. Pharmacokinetic parameters were associated with four SNPs of the ABCB1, ABCC3, NR1I2, and GGH genes, and toxicity was shown to be associated with ABCC1 rs246219 and ABCC2 rs717620 using the univariate and BN-BMLA method. BN-BMLA analysis detected relevant effects on the AUC0-48 in the following SNPs: ABCB1 rs928256, ABCC3 rs4793665, and GGH rs3758149. In both univariate and multivariate analyses the SNPs ABCB1 rs928256, ABCC3 rs4793665, GGH rs3758149, and NR1I2 rs3814058 SNPs were relevant. These SNPs should be considered in future dose individualization during treatment

    Mobile Regulatory Cassettes Mediate Modular Shuffling in T4-Type Phage Genomes

    Get PDF
    Coliphage phi1, which was isolated for phage therapy in the Republic of Georgia, is closely related to the T-like myovirus RB49. The ∼275 open reading frames encoded by each phage have an average level of amino acid identity of 95.8%. RB49 lacks 7 phi1 genes while 10 phi1 genes are missing from RB49. Most of these unique genes encode functions without known homologs. Many of the insertion, deletion, and replacement events that distinguish the two phages are in the hyperplastic regions (HPRs) of their genomes. The HPRs are rich in both nonessential genes and small regulatory cassettes (promoterearly stem-loops [PeSLs]) composed of strong σ70-like promoters and stem-loop structures, which are effective transcription terminators. Modular shuffling mediated by recombination between PeSLs has caused much of the sequence divergence between RB49 and phi1. We show that exchanges between nearby PeSLs can also create small circular DNAs that are apparently encapsidated by the virus. Such PeSL “mini-circles” may be important vectors for horizontal gene transfer
    corecore